Adam Loy
Carleton College
ICSA 2022 Applied Statistics Symposium
Data wrangling
Rendering graphics
Conditional execution
Iteration
Writing functions
Web scraping
Developing logical solution to a problem
Planning a large/complex software project
Modularization
Profiling
Unit testing
Focus: statistical content
Data acquisition
Data exploration
Inference
Communication
Focus: problem solving and technical skills acquisition
Data science
Computational statistics
Capstones
All statistics majors should be exposed to this set of ideas within the curriculum
We could design whole courses around this idea, but requires more resources and curricular redesign
More realistic to add a module/topic/assignment exploring one skill
Modules can be added to courses with less “disruption”
Example: adding modularization
the process of identifying and creating a set of reliable helper functions to support a complex computational goal
helps students think about the solution before they attempt to code it
makes code easier to read and debug
robust – unit tests are easier to write
Who?
What?
When?
Where?
How?
Upper-division statistics students; intro data science students
Modularize familiar code (i.e., recipes from other courses)
In a statistics computing course or capstone, data science course
Either in class or as homework
Start with a familiar implementation, and provide scaffolded path to create a modularized function
Carefully read through the code, noting what each line of code does.
Create a list of small/simple tasks that are implemented in the above code chunk.
Discuss this your answer to #2 with your group. Settle on a list of small/simple tasks that are implemented and record these on the shared Google doc
Are any of the simple tasks already implemented in an R function? If so, which ones?
Write an R function for each simple task identified. These functions are your set of helper functions.
resample <- function(obs_sample, B) {
stats <- numeric(B)
for(i in 1:B) {
x <- sample(obs_sample, replace = TRUE)
stats[i] <- mean(x)
}
stats
}
plot_hist <- function(stats) {
ggplot(data = NULL, aes(x = stats)) +
geom_histogram()
}
calc_summary <- function(obs_sample, stats) {
data.frame(
observed = mean(obs_sample),
se = sd(stats),
mean = mean(stats),
bias = mean(stats) - mean(obs_sample)
)
}Using your set of helper functions, create a bootstrap() function that takes
obs_sample = observed data vectorB = # of resamplesas inputs, and prints the plot and returns bootstrap summary data frame
Test your function
Here’s a non-modularized version of this function. Compare and contrast the readability to your function.
bootstrap_nonmod <- function(obs_sample, B) {
stats <- numeric(B)
for(i in 1:B) {
x <- sample(obs_sample, replace = TRUE)
stats[i] <- mean(x)
}
p <- ggplot(data = NULL, aes(x = stats)) +
geom_histogram()
print(p)
data.frame(
observed = mean(obs_sample),
se = sd(stats),
mean = mean(stats),
bias = mean(stats) - mean(obs_sample)
)
}Who?
What?
When?
Where?
How?
Intro data science students
Streamline interactivity in a Shiny app
In a data science course
Either in class or as homework
Start with a clunky Shiny app, provide guided exploration to discover and fix issues
Read through the code in the renderPlot() and renderPrint() reactive expressions. Note the key tasks executed within each.
Is there any replication? That is, are any key tasks executed multiple times?
h3("Bootstrap distribution")
renderPlot({
boot <- resample(obs_data, B = input$n_boot)
data.frame(means = boot) %>%
ggplot(aes(x = means)) +
stat_histinterval(breaks = input$nbins, .width = input$conf_level, size = 20) +
theme_minimal() +
labs(x = paste("Mean of flipper length"))
})
h3("Bootstrap percentile interval")
renderPrint({
boot <- resample(obs_data, B = input$n_boot)
alpha <- 1 - input$conf_level
quantiles <- quantile(boot, probs = c(alpha/2, 1 - alpha/2))
c(mean = mean(boot), quantiles)
})render* statements. Are any of these user inputs?conf_level (user input)nbins (user input)n_boot (user input)bootFor each reactive expression you identify in step #5, which output (the histogram or the confidence interval) should observe that expression and update when it is changed?
n_boot → boot → bothconf_level → both (but only interval)nbins → histogramUse your answers to steps #5 & 6 to reorganize the server-side code for this Shiny app.
boot(), to store the bootstrap statisticsrenderPlot() and renderPrint() statements to use the reactive object you just created.renderPlot({
boot <- resample(obs_data, B = input$n_boot)
data.frame(means = boot) %>%
ggplot(aes(x = means)) +
stat_histinterval(breaks = input$nbins, .width = input$conf_level, size = 20) +
theme_minimal() +
labs(x = paste("Mean of flipper length"))
})
renderPrint({
boot <- resample(obs_data, B = input$n_boot)
alpha <- 1 - input$conf_level
quantiles <- quantile(boot, probs = c(alpha/2, 1 - alpha/2))
c(mean = mean(boot), quantiles)
})renderPlot() and renderPrint() statements to use the reactive object you just created.boot <- reactive({
resample(obs_data, B = input$n_boot)
})
renderPlot({
data.frame(means = boot()) %>%
ggplot(aes(x = means)) +
stat_histinterval(breaks = input$nbins, .width = input$conf_level, size = 20) +
theme_minimal() +
labs(x = paste("Mean of flipper length"))
})
renderPrint({
alpha <- 1 - input$conf_level
quantiles <- quantile(boot(), probs = c(alpha/2, 1 - alpha/2))
c(mean = mean(boot()), quantiles)
})Statistics majors need to be able to program for and with data, not simply write data-driven code
For upper-division students, this can be done in traditional curriculum
Reactive programming in Shiny is another way to hone these skills